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Summary 

This  report  introduces  a  general  model  of  parallel  performance.  With  the  goal  of  developing 
conceptual  and  empirical  methods  for  characterizing  and  understanding  parallel  algorithms, 
new  definitions  of  speedup  and  efficiency  have  been  formulated.  These  definitions  take 
into  account  the  effects  that  problem  size  and  the  number  of  processors  have  on  efficiency 
and  speedup,  and  provide  a  natural  and  quantifiable  measure  of  parallel  performance.  The 
terms  introduced  in  the  definitions  provide  new  and  improved  interpretations  of  the  “serial"1 
and  “parallel”  fraction  parameters  commonly  used  in  the  literature  (t.e.,  Amdahl’s  Law)  to 
explain  the  behavior  of  parallel  algorithms.  The  model  provides  a  more  complete  charac¬ 
terization  of  parallel  algorithm  behavior  and  is  used  to  correct  apparent  deficiencies  in  the 
formulation  of  speedup  as  expressed  by  Amdahl’s  law. 
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1.  Introduction 

The  field  of  parallel  computing  will  become  an  established  discipline  when  cost-effective 
high-performance  parallel  algorithms  can  be  routinely  implemented  based  on  natural  par¬ 
allel  programming  models.  An  important  issue  to  be  resolved  in  the  field  is  how  to  effec¬ 
tively  analyze  the  performance  of  parallel  algorithms.  With  the  goal  of  developing  analytical 
methods  for  characterizing  and  understanding  parallel  algorithm  implementations,  this  re¬ 
port  proposes  new  definitions  of  speedup  and  efficiency.  These  definitions  provide  a  more 
natural,  quantifiable,  and  multifaceted  measure  of  the  performance  of  parallel  algorithms 
than  current  models  such  as  Amdahl’s  law  (Ref.  1)  offer.  In  addition,  the  new  parameters 
introduced  in  our  definitions  of  speedup  and  efficiency  provide  new  and  improved  interpre¬ 
tations  of  the  “serial”  and  “parallel”  fraction  quantities  frequently  used  in  the  literature  to 
explain  the  behavior  of  parallel  algorithms.  These  definitions  take  into  account  the  effects 
that  both  the  problem  size  and  the  number  of  processors  have  on  efficiency  and  speedup  and 
allow  us  to  formulate  and  prove  a  number  of  basic  laws  which  form  the  basis  of  this  model. 
The  model  provides  the  foundation  for  future  theoretical  and  empirical  studies  which  will 
contribute  to  a  deeper  understanding  of  parallel  algorithms. 

Chapter  2  of  this  report  reviews  the  basic  definitions  of  speedup  and  efficiency  and  intro¬ 
duces  new  definitions  of  these  quantities  in  terms  of  work  units  and  presents  two  illustrative 
examples.  Chapter  3  discusses  two  models  of  parallel  performance  which  appear  in  the  par¬ 
allel  computing  literature.  The  two  formulations  of  speedup  and  efficiency  are  based  on  the 
notion  of  “serial”  and  “parallel”  fractions  and  present  an  apparent  dichotomy  which  must 
be  resolved.  Then,  the  “serial”  and  “parallel”  fractions  are  expressed  in  terms  of  the  new 
work  quantities  introduced  in  Chapter  2,  providing  a  natural  interpretation  of  the  serial  and 
parallel  fractions  of  a  task.  The  resulting  equations  are  used  to  reconcile  the  (apparent ) 
dichotomy  between  Amdahl’s  law  and  the  more  recent  formulation  of  speedup  proposed  in 
Ref.  2.  Derivation  of  several  fundamental  laws  involving  the  relationships  between  the  serial 
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and  parallel  fractions  of  a  task  are  given  and  numerical  evidence  regarding  the  behav  ior  of 
the  serial  and  parallel  fractions  as  the  problem  and  ensemble  size  vary  are  presented.  These 
observations  are  incorporated  into  an  idealized  and  a  general  model  of  parallel  performance. 
Finally,  Chapter  4  discusses  conclusions  and  directions  for  future  work. 


2 


AFWL-TR-89-01 


2.  Background 


2.1  Measures  of  Performance 

Speedup  has  been  advocated  as  the  primary  measure  of  parallel  algorithm  performance. 
Intuitively,  the  speedup,  5,  is  defined  as  the  relative  increase  in  speed  over  the  serial  com¬ 
putation  as  processing  elements  are  added  to  the  parallel  computation  of  a  given  work  load. 
Formally,  this  can  be  stated  as 


S{r. .) 


T(  1) 

T(np) 


(1) 


where  T(np)  is  the  time  expended  performing  a  task  using  rij,  processors.  Since  parallel 
implementations  may  introduce  computations  which  are  unnecessary  with  respect  to  serial 
implementations,  T(l)  is  the  time  required  to  execute  the  task  on  a  single  processor  using 
the  “best”  serial  implementation.  Clearly,  the  time  required  to  execute  a  task  depends  on 
the  number  of  operations  that  need  to  be  performed,  which,  in  turn,  is  a  function  of  the 
problem  size.  Therefore,  a  more  complete  formulation  of  speedup  needs  to  take  into  account 
the  size  of  the  problem.  An  alternate  formulation  of  speedup  is 


S(n,np) 


Tjn,  1) 

T(n,np) 


(2) 


where  n  denotes  the  problem  size  (this  view  is  also  formulated  in  Ref.  3).  Ideally,  one 
expects  S  to  equal  np;  that  is,  as  processing  elements  are  added,  speedup  should  increase 
at  the  same  rate.  However,  this  is  seldom  the  case  because  of  overhead  introduced  by  the 
parallel  implementation  (Ref.  4).  Therefore,  efficiency  is  often  used  to  measure  the  optimality 
of  the  speedup.  Efficiency  is  expressed  as 


E  = 


S_ 

np 


(3) 
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Because  efficiency  normalizes  on  the  number  of  processors  used,  it  is  a  more  intrinsic  measure 
of  an  implementation’s  parallel  performance  than  speedup.  A  more  important  statistic  than 
speedup,  perhaps,  is  the  cost  of  attaining  a  given  speedup  which  is  measured  by  the  efficiency 
of  the  implementation.  The  continued  reliance  on  speedup  as  the  primary  measure  of  parallel 
performance  may  be  attributed  to  the  unit  of  measurement  that  has  been  adopted:  execution 
time.  Using  time  as  a  measure  of  work  has  several  drawbacks.  First,  it  varies  with  the 
computer  used.  Second,  it  is  simply  a  statistic  which  does  not  provide  any  particular  insight 
about  the  algorithm.  We  need  to  describe  the  efficiency  of  a  parallel  implementation  in  terms 
of  a  measure  of  work  that  is  quantifiable  and  useful  for  interpretations  of  the  observable 
behavior  of  the  implementation. 


2.2  Alternate  Definition  of  Speedup  and  Efficiency 

An  alternative  measure  of  performance  is  provided  by  computational  counts  or  unit  counts 
based  on  the  size  of  an  indivisible  task.  This  same  measure  of  work  has  been  used  to  compare 
serial  implementations  of  an  algorithm  with  order,  growth  rate,  and  complexity  analysis 
functions  such  as  ©,fi,  and  Big  O.  In  these  terms,  the  efficiency  of  a  parallel  algorithm  can 
be  defined  as 


E  = 


wa 

we 


(4) 


where  wa  is  the  work  accomplished,  we  is  the  work  expended,  and  ww  =  we  —  wa  is  the 
work  wasted.  For  a  given  implementation,  the  work  wasted  accounts  for  the  time  expended 
in  the  following  activities: 


•  Waiting  for  other  tasks  to  complete  work 

•  Communication  delays  and/or  memory  contentions  associated  with  a  particular  com¬ 
puter  architecture  and  an  implementation’s  communication  load 

•  Operation  redundancies  introduced  by  the  particular  parallel  implementation,  includ¬ 
ing  task  activation/termination  overhead 
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Work  accomplished,  wa,  is  a  function  of  the  problem  size  but  is  independent  of  the  number  of 
processors  used  since  it  refers  to  the  amount  of  work  that  a  single  processor  would  accomplish. 
In  terms  of  units  of  measurement,  wa  equals  the  number  of  operations  performed  by  the 
“best”  serial  implementation  of  the  algorithm.  On  the  other  hand,  work  wasted  is  a  function 
of  program  size  and  the  number  of  processors  used  since,  for  example,  problem  size  affects  the 
number  of  redundant  operations  introduced,  and  ensemble  size  affects  the  waiting  behavior 
of  the  implementation.  Therefore,  Eq.  4  can  be  restated  as 


£  _  wain) 

ww{n,np)  +  wa(n) 

and  speedup  can  be  defined  in  terms  of  efficiency  as 


(5) 


S  =  E  *np 


wa{n) 

ww(n,np)  +  wa(n) 


*  np 


(6) 


A  similar  formulation  is  introduced  in  Ref.  4  where  efficiency  is  defined  as  a  function  of  _£. 
the  ratio  between  communication  and  computation  loads.  In  fact,  Eq.  5  can  be  rewritten  in 
the  form: 


(ww/wa)  +  1 

which  looks  very  much  like  the  formulation  in  Ref.  4  with  fc  =  ww/wa.  The  difference, 
however,  is  of  practical  concern,  since  the  formulation  in  Ref.  4  only  considers  time  wasted 
due  to  message  delays,  while  the  new  definitions  take  into  account  the  effects  of  load  balance, 
task  overhead,  and  operation  redundancies.  These  formulations  of  efficiency  and  speedup 
parameterize  algorithms  and  clearly  demonstrate  the  effect  of  the  individual  work  parameters 
on  the  overall  value  of  the  quantities.  As  such,  the  formulations  may  be  better  suited  for 
comparative  studies  of  parallel  implementations. 
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(»)  Tack  dependency  graph. 


r(l)  -  9; 

5-1; 

£-1; 

T(2)  -  6; 

5-9/6; 

£-3/4; 

T(3)  -  4; 

5  *  9/4; 

£-3/4; 

T( 4) -4; 

S  -  9/4; 

£-9/16; 

(b)  Scheduling  chart. 

Figure  1.  Task  dependency  and  scheduling  charts  for  Example  1. 


To  show  the  applicability  of  these  new  definitions,  two  examples  are  presented.  The  first 
example  is  a  simple  artificial  problem  used  to  introduce  some  general  concepts.  The  second 
example  is  based  on  a  parallel  implementation  of  a  Cholesky  matrix  factorization  algorithm. 


2.2.1  Case  Study  1 

Figure  la  shows  a  directed  acyclic  graph  which  represents  a  possible  computational  task. 
Nodes  in  the  graph  represent  indivisible  tasks,  and  the  arcs  represent  task  dependencies.  For 
example,  task  b  cannot  start  until  task  a  is  accomplished.  Figure  lb  shows  the  scheduling 
of  independent  tasks  on  1,  2,  3,  and  4  processors.  For  simplicity,  each  task  is  assumed  to 
accomplish  the  same  amount  of  work  and  no  time  is  wasted  due  to  communication  delays  or 
operation  redundancies. 

The  multitasked  implementations  indicate  portions  of  wasted  effort,  xvw ,  due  to  task  depen¬ 
dencies  as  unlabeled  segments.  The  speedup  and  efficiency  of  each  implementation  (shown 
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Table  1.  Speedup  and  efficiency  for  Example  1. 


np 

ww 

wa 

we 

E  =  wa/  we 

S  =  Eiy 

1 

0 

9 

9 

1 

1 

2 

3 

9 

12 

3/4 

3/2 

3 

3 

9 

12 

3/4 

9/4 

4 

7 

9 

16 

9/16 

9/4 

5 

11 

9 

20 

9/20 

9/4 

on  the  right  of  each  graph)  are  computed  using  the  conventional  Eqs.  1  and  3.  Table  1  shows 
speedup  and  efficiency  computed  using  the  new  definitions  as  expressed  by  Eqs.  5  and  6.  As 
can  be  seen,  the  speedup  and  efficiency  computed  by  the  different  definitions  coincide. 

Note  that  a  maximum  speedup  of  9/4  is  reached  with  three  processors  and  that  adding 
processors  only  increases  the  work  wasted  which  simply  decreases  the  efficiency.  Also,  note 
that  the  maximum  speedup  of  9/4  equates  to  the  reciprocal  of  the  ratio  of  the  length  of 
the  longest  dependency  chain  in  the  dependency  graph  (4)  to  the  total  number  of  tasks  (9). 
Chapter  3  shows  that  this  ratio  represents  the  speedup  bound  predicted  by  Amdahl’s  law. 

2.2.2  Case  Study  2 

A  common  algorithm  applied  to  structural  analysis,  hydrodynamics,  and  least  square  prob¬ 
lems  is  the  Cholesky  factorization  of  an  n,/,  order  square  matrix.  The  general  solution  of 
the  Cholesky  factorization  is  a  specialized  form  of  the  more  common  Gaussian  elimination 
algorithm  (see  Ref.  5).  The  Cholesky  method  takes  advantage  of  symmetry  properties  of  the 
matrix  to  reduce  the  complexity  of  factorization  from  an  0(4n?/3)  problem  as  solved  by  the 
Gaussian  elimination  method  (Ref.  6)  to  an  0(r?/ 3)  problem  (Ref.  7).  Depending  on  how 
the  looping  variables  are  arranged,  there  are  six  different  implementations  of  this  algorithm 
(Ref.  8).  However,  the  following  discussion  concentrates  on  the  ijk  form  of  Cholesky. 

The  algorithm  is  designed  for  a  shared  memory  architecture  and  relies  on  a  dynamic  schedul¬ 
ing  scheme  to  assign  rows  of  the  matrix  to  available  processors.  That  is,  when  a  processor 
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Figure  2.  Cholesky  data  dependency  chart. 

becomes  available,  the  next  row  to  be  computed  is  assigned  to  that  processor.  The  order  in 
which  rows  are  assigned  to  processors  is  contingent  on  (1)  the  data  dependencies  associated 
with  the  algorithm,  and  (2)  the  manner  in  which  the  algorithm  is  implemented. 

The  data  dependencies  for  a  7  x  7  matrix  are  shown  in  Fig.  2.  Each  node  2^  represents  the 
ijth  element  of  the  resultant  matrix.  These  data  dependencies  are  inherent  in  the  algorithm. 
The  synchronization  scheme  adopted  by  a  given  implementation  must  ensure  that  these 
inherent  dependencies  are  preserved. 

The  ijk  form  of  Cholesky  solves  the  system  of  equations  row-wise.  Observe  that,  in  order 
to  factor  the  ith  row,  the  previous  i  —  1  rows  are  required.  More  specifically,  to  compute 
the  jth  element  of  any  row,  the  elements  of  the  jth  row  are  required.  Thus,  one  possible 
implementation  can  have  task  i  wait  for  the  jth  row  to  be  computed  before  the  ijth  element 
is  computed.  The  basic  algorithm  is  similar  to  the  implementation  of  the  column-Cholesky 
in  Ref.  9  and  is  as  follows: 


for  task  i 

for  j  :■  1  to  i-1  do 
vait  for  jth  row 
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Table  2.  Speedup  and  efficiency  for  Example  2. 


np 

ww 

wa 

we 

E  =  wa/we 

S  —  Enp 

1 

0 

168 

168 

1 

1 

2 

44 

168 

212 

84/106 

84/53 

3 

126 

168 

294 

84/147 

84/49 

4 

224 

168 

392 

84/196 

84/49 

0 

322 

168 

490 

84/245 

84/49 

for  k  :*  1  to  j-1  do 

a(i,j)  :*  a(i,j)  -  a(i,k)*a(j ,k) ; 

end 

a(i, j)  *  a(i, j)/a(j , j) ; 
a(i,i)  -  a(i,i)  -  a(i,j)*a(i,j) ; 
end 

a(i,i)  ■  sqrt(a(i,i)) ; 

end 


In  order  to  create  the  scheduling  chart  for  the  Cholesky  algorithm,  assume  that  each  additive 
and  multiplicative  operation  corresponds  to  one  unit  of  work,  while  divide  and  square  root 
operations  cost  two  work  units  each.  Based  on  these  assumptions,  it  takes  2 j  work  units 
to  compute  the  jth  element  of  any  row.  Scheduling  charts  for  implementations  on  1  —  4 
processors  axe  shown  in  Fig.  3  with  their  respective  speedup  and  efficiency  (shown  on  the 
right  of  each  graph)  computed  using  Eqs.  1  and  3.  The  scheduling  charts  do  not  take  into 
account  communication  delays  or  time  required  for  task  switching.  Table  2  shows  speedup 
and  efficiency  computed  with  the  alternate  definitions.  Once  again,  speedup  and  efficiency 
statistics  computed  by  the  new  definitions  and  the  standard  definitions  coincide. 

One  can  generalize  the  effects  n  and  np  have  on  speedup  and  efficiency  for  the  above  imple¬ 
mentation  by  deriving  closed-form  equations  which  estimate  wa  and  ww.  A  computational 
count  shows  that  there  are  n3/ 3  —  n/3  multiplications  and  additions  and  (r?  4-  n)/2  divisions 
and  square  roots.  Thus  the  work  accomplished  is  wa  =  (n?/ 3  +  n2.+  2n/3)  computational 
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units  (division  and  square  root  operations  count  for  two  operations  each).  This  expression 
is  an  exact  measure  of  work  accomplished  based  on  our  assumptions. 

To  derive  a  closed-form  equation  for  the  work  wasted,  however,  is  not  so  straightfoward. 
The  effect  of  problem  and  ensemble  size  on  the  waiting  behavior  of  an  implemention  can  be 
somewhat  difficult  to  measure.  To  estimate  wasted  work,  a  simulation  based  on  the  same 
assumptions  made  above  concerning  the  cost  of  each  operation  was  performed.  Based  on 
the  simulation,  a  statistical  analysis  showed  that  for  n  >  50,  ww  can  be  approximated  by 
the  expression 


ww 


n‘ 


(38  4-  n/10)np  +  — —  149 


{np  -  1) 


Figure  4a  shows  a  graph  of  simulated  work  wasted,  while  Fig.  4b  shows  simulated  speed  up 
and  estimated  speedup  using  the  approximations  of  wa  and  ww.  Based  on  these  approxima¬ 
tions,  several  observations  can  be  made.  First,  as  the  size  of  the  problem  increases,  speedup 
increases  for  a  given  ensemble  size.  This  is  due  to  an  increase  in  efficiency  as  problem  size 
increases.  Second,  as  the  number  of  processors  increases,  efficiency  decreases  for  a  given 
problem  size.  Figures  5a  and  5b  are  graphs  of  constant  speedup  and  efficiency  curves  based 
on  our  estimates  of  ww  and  wa.  These  curves  show  that  a  given  speedup  can  be  attained 
from  different  combinations  of  problem  and  ensemble  size. 

As  can  be  seen,  the  waiting  behavior  of  an  implementation  can  be  determined  analytically  by 
using  closed-form  expressions  of  the  new  fundamental  parameters.  However,  in  what  follows, 
the  relationship  between  the  new  definitions  and  other  more  known  definitions  of  speedup 
and  efficiency  are  explored. 
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Figure  5.  Constant  speedup  and  efficiency  curves. 
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3.  Models  of  Parallel  Performance 

3.1  Two  Views  of  Speedup 

The  parallel  computing  literature  contains  a  number  of  different  formulations  of  speedup  and 
efficiency  based  on  the  “serial”  and  “parallel”  fractions  of  a  given  task.  This  section  examines 
two  different  points  of  view  for  modeling  performance  of  a  parallel  system.  From  the  first 
perspective,  the  speedup  is  viewed  in  the  context  of  the  execution  time  of  the  serial  task  on 
multiple  processors.  From  the  second  perspective,  the  speedup  is  viewed  in  the  context  of 
the  execution  time  of  the  parallel  task  on  a  single  processor. 

The  first  point  of  view  is  expressed  by  Amdahl’s  law  (  Ref.  1)  which  states  that  if  s  is  the 
fraction  of  time  spent  on  the  serial  portion  of  a  task  and  p  is  the  fraction  of  time  spent  on 
portions  of  the  task  that  can  be  executed  in  parallel  (i.e.,  s  +  p  =  1),  then  the  time  required 
for  a  parallel  system  to  execute  the  task  is  (s  -f  p/rip)  *  T(l).  Based  on  these  definitions  of 
s  and  p,  and  using  Eq.  1,  speedup  can  be  expressed  as 

(s  +  p)*T(l)  s+p _ 1 _ 

amdahl  (s  +  P/np)*T{  1)  S  +  P/np  3  +  P/np 

where  speedup  has  been  normalized  with  respect  to  time.  Figure  6a  shows  the  effect  s  has 
on  speedup  for  a  given  number  of  processors.  The  Amdahl  point  of  view  is  that  the  serial 
portion  of  an  algorithm  bounds  the  maximum  speedup  that  can  be  attained  as  processing 
elements  are  added  to  the  computation.  Therefore,  as  rip  — »  oo,  the  term  p/np  — »  0,  so 
speedup  is  asymptotically  bounded  by  1/s.  This  formulation  also  implies  that  for  a  given 
problem  size,  the  maximum  speedup  attainable  is  not  reached  until  an  infinite  number  of 
processors  are  used. 

The  second  point  of  view  was  recently  formulated  by  researchers  at  Sandia  National  Labo¬ 
ratories.  In  Ref.  2,  Amdahl’s  law  is  restated  as  follows:  let  d  be  the  portion  of  time  spent 
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(a)  Amdahl’s  Law 


(b)  Sandia’s  reformulation. 


Figure  6.  Speedup  given  by  Amdahl’s  Law  and  by  Sandia’s  reformulation. 


performing  serial  work  on  the  parallel  system  executing  the  task,  and  let  jf  be  the  portion  of 
time  spent  performing  parallel  work  on  the  parallel  system  (i.e.,  d  +  p'  =  1).  Then  the  time 
required  by  a  single  processor  to  perform  the  task  is  (s'  +  p'np)  *  T(np).  Therefore,  speedup 
can  be  expressed  as 


_  +  p'np)  ♦  T(np) 

(s'  +  p')  *  T(np) 


s'  -f  p'np 
s'  +  p' 


s'  +  p'np 


(9) 


Figure  6b  shows  a  sample  graph  of  speedup  as  defined  by  Eq.  9  for  rip  =  1024.  The 
apparent  dichotomy  between  the  two  definitions  of  speedup  is  that  Samdahi  predicts  modest 
upper  bounds  on  speedups  for  serial  fractions  in  the  range  0.01  -  0.04,  while  S,an<ita  predicts 
potential  speedups  of  10  to  40  times  greater  magnitude  in  the  same  range.  Because  both  laws 
are  based  on  a  common  definition  of  speedup  (Eq.  1),  it  is  striking  that  the  two  formulations 
are  so  distinct.  However,  if  it  is  assumed  that  speedup  definitions  (Eqs.  6,  8,  and  9) 
are  equivalent,  the  apparent  dichotomy  between  the  two  points  of  view  can  be  reconciled 
by  expressing  the  serial  and  parallel  fractions  in  terms  of  the  parameters  ww,  wa,  we ,  and 
the  number  of  processors,  np.  In  particular,  it  follows  that  the  two  different  formulations 
of  speedup  can  be  considered  equivalent,  assuming  the  fractional  entities  are  expressed  in 
terms  of  the  work  parameters  and  rip.  Equating  Eqs.  6  and  8,  one  obtains: 
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1 

s  4-  p/nv 
wa  +  ww 
wa 

1  4  ww/wa 


wa 

wa  +  ww 


*  n 


p 


n'(5+^r) 

■s(np)  4  1-5 


ww/wa 

s 


K  >  2) 


(10) 


Therefore,  s  can  be  interpreted  as  the  distribution  across  the  additional  processors  of  the 
ratio  of  work  wasted  to  work  accomplished.  Equating  Eqs.  6  and  9,  one  obtains 


wa 

n  * - 

wa  4  ww 


nP 

np 


wa 


wa  4  ww 
wa 


-  n„ 


( - — - l) 

\wa  4  ww  J 


nv  (- 


ww\ 


we 


s'  4  (1  -  5')np 
s'(l  —  np)  4  np 
5'(I  -np) 

5;(1  -  np) 


5'(1  -np) 

ww  np 

- * - £ — 

we  np  —  1 


K  >  2) 


(ID 


Therefore,  s'  can  be  interpreted  as  a  collective  wasted  effort,  np*s.  where  s  is  the  distribution 
across  the  additional  processors  of  the  ratio  of  work  wasted  to  work  expended.  Notice  that 
both  interpretations  of  s  and  s'  are  undefined  for  np  =  1.  This  reflects  the  fact  that  it 
is  inconsistent  to  consider  serial  and  parallel  portions  of  a  task  in  a  strictly  serial  context; 
the  fractions  have  meaning  only  in  a  parallel  context.  In  addition,  since  ww  and  wa  are 
functions  of  n  and  np,  it  follows  that  the  parameters  s,d,p,  and  pf  are  not  only  functions 
of  the  problem  size,  but  also  functions  of  the  number  of  processors  used.  Although  this 
assertion  seems  intuitively  inconsistent  with  the  a  priori  definitions  of  these  parameters,  it 
is  based  on  both  theoretical  considerations  and  empirical  evidence. 
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Figure  7.  Box  diagram  of  Amdahl’s  serial  and  parallel  fractions. 


A  useful  diagram  for  understanding  the  meaning  of  s,i,p,  pf  can  be  developed  as  follows: 
based  on  Amdahl’s  definitions  of  s,  there  exists  a  portion  of  an  algorithm  that  must  be 
executed  serially  and  another  portion  that  can  be  executed  in  parallel.  For  five  processors 
the  idealized  work  distribution  may  have  the  form  shown  in  Fig.  7  where  the  shaded  region 
indicates  wasted  work.  A  more  realistic  distribution  of  work  for  a  multiprocessor  system  is 
depicted  in  Fig.  8a  where,  strictly  speaking,  there  is  no  exclusively  serial  portion  of  work. 
Assume,  for  purposes  of  illustration,  that  in  a  general  parallel  system  with  processors, 
the  wasted  work  throughout  the  total  execution  time  is  initially  grouped  on  the  last  —  1 
processors  during  the  time  period  0  <  t  <  to  =  ww/(np  —  1).  Furthermore,  suppose  that  the 
total  work,  we,  is  expended  over  the  time  period  0  <  t  <  t j  =  we/np.  We  can  now  interpret 
the  parameters  s,  s',  p,  pf  in  terms  of  relative  areas  of  the  new  diagram,  Fig.  8b.  For  example, 
s  is  the  ratio  of  the  width  of  to  to  the  length  of  the  total  white  area,  wa,  while  s'  is  the  ratio 
of  the  width  of  t0  to  the  total  width  of  the  box,  tx.  Similar  interpretations  can  be  given  for 
p  and  pf.  In  addition,  since  the  diagram  is  expressed  in  terms  of  ww,wa,  and  t^,,  one  can 
derive  Eqs.  10  and  11  from  geometric  considerations.  For  example,  since  s  =  to/wa  and 
to  =  ww/(np  —  1)  then  s  =  [ww/(np  —  l)]/u;a  which  is  the  same  interpretation  of  s  given  by 
Eq.  10.  The  parameters  s',  p,  and  pf  can  be  derived  similarly.  These  graphical  representations 
of  tasks  provide  insight  about  the  properties  of  s,  s',  p,  and  pf.  In  this  parallel  framework 
these  parameters  have  real  meaning.  Amdahl’s  s  parameter  has  an  averaging  effect  in  that 
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Figure  8.  Realistic  distribution  of  serial  and  parallel  fractions. 


it  represents  the  portion  of  work  which  would  have  to  be  done  100  percent  serially  so  that 
a  portion  p  of  the  work  could  theoretically  be  performed  with  all  processors  working  at  100 
percent  efficiency.  Generally,  as  the  number  of  processors  is  increased,  work  wasted  changes 
and  the  fractions  s  and  p  change. 


3.2  Basic  Laws 

Using  the  expressions  for  s  and  s'  found  in  Section  3.1,  several  fundamental  laws  can  be 
derived  which  illustrate  the  relationships  between  the  respective  serial  and  parallel  fractions. 
For  convenience,  let  a  =  wa/we  —  E  and  let  w  =  ww/we.  Then  Eqs.  10  and  11  can  be 
written  in  the  following  form: 

s  =  (w/a)(l/np  —  1)  (np  >  2)  (12) 

s'  =  w[np/(np  -  1)]  (np  >  2)  (13) 

Using  these  equations,  the  following  fundamental  laws  can  be  derived: 


Speedup  = 


if  ww  >0 

Efficiency  = 

if  ww  =  0 


p'/p  if  ww  >  0 
1  if  ww  =  0 


(14) 
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Proof:  Using  Eqs.  12  and  13,  one  obtains 


s 


wnp  a(np  —  1) 
np  —  1  w 


=  anp  =  Enp 


=  5 


Using  the  definitions  of  p  and  jf,  Eq.  11,  and  the  fact  that  E  =  l  —  w,  one  derives 


p' 

P' 


1-s' 


(1  -  w)  -  a  -  ( — 1—\ 
[a  \np-l)J 


E-Es 
E{  1  -  s) 
Ep 


The  basic  law  says  that  the  serial  fractions  of  a  task,  which  have  been  used  to  define  speedup, 
are  related  precisely  by  that  speedup!  The  law  also  states  that  the  parallel  fractions  of  a 
task  which  have  been  used  in  defining  speedup  are  related  precisely  by  the  efficiency  of  the 
parallel  system*.  Also,  since  E  <  1  and  S  <  rip,  the  above  proof  establishes  the  following 
inequalities: 


s'  <  snp 

(15) 

P'<P 

(16) 

"The  basic  laws  can  also  be  derived  by  expressing  s  and  d  in  terms  of  5  in  Eqs.  8  and  9  and  taking  the 
corresponding  ratios. 
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The  preceding  work  leads  to  a  more  unified  view  of  speedup  and  important  reinterpretations 
of  the  parameters  affecting  the  parallel  performance  of  algorithms.  The  unified  view  of 
speedup  is  provided  as  follows:  if  the  expressions  for  s  and  d  are  substituted  in  Eqs.  8 
and  9,  respectively,  the  formulations  of  speedup  in  Ref.  1  and  Ref.  2  coincide  with  the  new 
definition  of  speedup.  In  addition,  using  the  basic  law,  the  two  speedup  graphs  shown  in 
Fig.  6  can  be  correlated.  For  purposes  of  illustration,  assume  speedup  is  a  linear  function 
ol  s'  \  S  —  —ms'  +  b ,  as  proposed  in  Ref.  2.  Substituting  d  =  sS  and  rearranging  terms, 
one  establishes  that  speedup  is  a  hyperbolic  function  of  s  :  S  =  6/(1  -f  ms),  as  proposed  in 
Amdahl’s  formulation  of  speedup.  Therefore,  the  apparent  dichotomy  between  the  different 
formulations  of  speedup  is  resolved.  In  addition,  the  relationships  between  s  and  i  are  easily 
derived  from  Eqs.  8,  9,  and  14. 


s  +  (1  -  s)/np 


(H) 


3 


S'  +( 1  -  S') np 


(18) 


In  order  for  the  different  formulations  of  speedup  to  remain  consistent,  it  is  necessary  to 
treat  the  serial  and  parallel  fraction  parameters  as  dynamic  entities  with  respect  to  i\,.  If 
one  insists  on  viewing  these  fractions  as  constants  for  a  given  problem  size,  the  formulations 
of  speedup  found  in  Ref.  1  and  Ref.  2  only  describe  the  limiting  behavior  of  a  general 
parallel  system.  Therefore,  from  this  point  on,  the  parameters  s,  d,  p,  and  p'  will  be  treated 
as  variables  which  depend  on  n  and  rip. 


3.3  Dynamics  of  s,p,  s'  and  p' 

This  section  illustrates  the  behavior  of  s  and  d  with  respect  to  n,  and  np.  These  observations 
will  support  the  assumptions  made  in  the  models  of  parallel  performance  presented  in  Sect  ions 
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Table  3.  Dynamics  of  fundamental  parameters  -  Example  1. 


Up 

s 

s' 

S  =  s'/s 

P 

p' 

II 

X. 

•S' 

2 

0.333 

0.500 

1.52 

0.667 

0.500 

0.750 

3 

0.167 

0.375 

2.25 

0.833 

0.625 

0.750 

4 

0.259 

0.583 

2.25 

0.741 

0.417 

0.563 

5 

0.306 

0.688 

2.25 

0.694 

0.313 

0.450 

Table  4.  Dynamics  of  fundamental  parameters  -  Example  2. 


IBS 

s 

s' 

5  =  s'/s 

P 

P' 

2 

0.262 

0.415 

1.585 

0.73S 

0.585 

0.792 

3 

0.375 

0.643 

1.714 

0.625 

0.357 

0.571 

D 

0.444 

0.762 

1.714 

0.556 

0.238 

0.429 

IX 

0.479 

0.821 

1.714 

0.521 

0.179 

0.343 

3.4  and  3.5.  Tables  3  and  4  show  sample  values  of  the  serial  and  parallel  fractions  as  expressed 
by  Eqs.  10  and  11  for  the  examples  presented  in  subsections  2.2.1  and  2.2.2  and  shows  the 
computation  of  speedup  and  efficiency  based  on  the  basic  law  expressed  in  Eq.  14. 

Notice  that  even  though  the  maximum  value  of  speedup  is  attained,  the  values  of  s  and  d 
(and  therefore  p  and  p')  continue  to  change  as  rip  increases.  This  occurs  because  the  work 
wasted  increases  as  rip  increases.  A  straightfoward  analysis  of  example  1  shows  that  for 
rip  >  3,  ww  =  4np  —  9  and  we  =  (4np  —  9)  ■+•  wa  =  4  rip.  Therefore,  s  =  (4np/9  —  l)/(np  —  1) 
and  s'  =  (rip  —  9/4)/(«p  —  1)  for  nv  >  3.  It  follows  that 

^lim  s  =  4/9  =  1 /Max Speedup 


and 


lim  s'  =  1 
rip— »oo 
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Figure  9.  Serial  Fraction,  s,  as  a  Function  of  rip. 

Based  on  the  simulation  described  in  subsection  2.2.2,  the  graphs  in  Fig.  9  of  the  s(n,  rj,) 
curves  were  obtained.  These  curves  illustrate  the  following  facts  about  the  simulation  of 
the  Cholesky  algorithm.  First,  as  problem  size  increases,  respective  s  values  decrease  (fixed 
np).  Second,  as  the  number  of  processors  increases,  the  s  values  increase  (fixed  n).  Third, 
the  points  of  inflection  of  the  curves  correspond  to  the  number  of  processors  at  which  the 
maximum  speedup  is  attained.  This  behavior  is  also  shown  by  the  tables  found  in  the 
appendix,  which  were  computed  from  the  timing  data  found  in  Ref.  2  and  Ref.  3  for  certain 
parallel  numerical  algorithms  implemented  on  a  1024-node  hypercube. 

3-4  An  Idealized  Model  of  Parallel  Performance 

The  preceding  theoretical  and  empirical  examples  suggest  certain  characteristic  behavior  of 
the  parameters  s,p,s',  and  p'.  For  example,  for  a  fixed  number  of  processors  rip,  as  the 
problem  size  increases,  Tables  3,  4,  and  those  in  the  appendix  show  that  s  decreases  and  S 
increases.  Also,  for  a  fixed  n,  as  rip  increases,  the  tables  seem  to  show  that  s'  approaches  1. 
Finally,  for  a  given  n,  the  tables  seem  to  indicate  that  the  values  of  s  fluctuate  until  a  certain 
rip  is  reached,  at  which  point  s  begins  to  increase  to  an  asymptotic  limit.  These  observations, 
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in  connection  with  the  laws  presented  in  Section  3.2,  form  the  basis  for  the  model  of  the 
performance  of  parallel  algorithms  that  will  be  presented  in  the  next  section.  However,  before 
presenting  the  model,  the  limiting  or  asymptotic  behavior  of  the  fundamental  parameters 
will  be  discussed. 

T.he  effect  of  scheduling  tasks  to  processors  can  be  thought  of  as  follows:  as  processors  are 
added  to  the  computation,  the  original  serial  task  is  partitioned  into  smaller  and  smaller 
independent  portions  while  preserving  the  inherently  sequential  portions  of  the  algorithm. 
There  is  a  point  where  further  partitioning  is  not  possible  without  conflicting  with  the 
inherent  dependencies.  Thus,  it  is  useful  to  base  the  discussion  of  asymptotic  behavior  on 
the  fundamental  assumption  that  there  is  a  largest  portion  of  a  task,  h,  which  is  indivisible 
(in  the  parallel  sense)*.  For  instance,  in  Example  2.2.1,  h  is  the  ratio  of  the  depth  of  the 
dependency  graph  ( d )  and  the  total  number  of  atomic  tasks  (nodes  in  the  graph)  (A),  so 
h  =  d/ A  =  4/9.  A  so-called  “embarrasingly  parallel”  or  purely  parallel  algorithm  may  have 
d  —  1,  the  time  required  to  execute  a  single  computation  or  instruction.  In  terms  of  the 
box  diagrams  used  in  Section  3.1,  the  width  of  a  box  will  never  be  less  than  d  work  units, 
restricting  any  further  speedup.  Pursuing  this  analogy,  in  an  optimum  implementation  where 
d  has  been  reached,  the  work  expended  is  we  =  dnv  and  the  work  accomplished  is  wa  =  A. 
Therefore,  ww  =  we  —  wa  =  dnp  —  A.  Using  Eq.  10,  the  “idealized”  s,  denoted  s,  is 


(19) 


The  preceding  formulation  is  valid  only  when  the  number  of  processors  used  equals  or  exceeds 
the  breadth  of  the  dependency  graph,  or  when  the  number  of  processors  needed  to  attain 

‘This  fraction  is  conceptually  different  from  Amdahl’s  s  in  that  it  represents  a  portion  of  work  that  can 
be  executed  in  parallel  with  other  tasks.  It  represents  work  that  must  be  performed  sequentially,  not  serially 
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the  maximum  speedup  is  reached.  This  reflects  the  fact  that  s  models  only  the  asymptotic 
behavior  of  s.  The  precise  relationship  between  s  and  s  is  obtained  by  using  Eq.  8  and 
writing  s  in  terms  of  speedup:  s  =  [(n,,/5)  -  l]/(np  -  1).  Then 


Once  the  maximum  speedup  is  obtained  for  a  given  implementation,  5  =  A/d  =  1  /h.  It 
follows  from  Eq.  20  that  s  =  s,  so 


lim  s  =  lim  s  =  h 

Tip— >00  Tlp—oo 


(21) 


Based  on  Eq.  21,  the  limiting  behavior  of  s  actually  defines  a  number  which  can  be  inter¬ 
preted  as  the  largest  irreducible  fraction  of  a  task  that  must  be  executed  in  a  sequential 
(not  serial)  fashion.  Furthermore,  based  on  the  assumption  that  linhp— <»  «  =  ««>  exists,  the 
following  results  can  be  established: 


(0 


lim  5  =  — 

np-*°°  Soo 


To  establish  (i),  based  on  Eq.  8, 


(«)  lim  s'  =  1 

n0—oo 


(22) 


|i-s|  =  |  s  +  (l  -s/nj,)-s  | 

=  (1  -  s)/np  <  l/np 

Therefore,  1/5  —  s  — *  0  as  rip  — »  oo.  To  establish  («),  by  Eq.  14, 


(23) 


s'  =  Ss,  so 


lim  s' 
np— *oo 
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One  can  also  derive  “idealized”  versions  of  s',  p,  and  p'  using  the  preceding  type  of  reasoning. 
For  example,  using  Eqs.  9  and  11  one  derives  3  =  (s^rip  —  l)/s00(rip  —  1),  so  that  5  =  3/s  = 
1  /sqo  (for  sufficiently  large  np).  These  idealized  versions  of  the  basic  parameters  constitute 
an  asymptotic  model  of  parallel  performance.  As  indicated  by  Eqs.  19  and  20,  the  actual 
versions  of  the  parameters  differ  from  the  parameters  in  the  asymptotic  model  by  error  terms 
which  approach  zero  as  rip  — ♦  oo  (provided  limnp— <»  s  =  s*,  exists). 

The  model  predicts  that  for  a  given  problem  size,  even  though  s  is  changing  with  r^,,  it 
asymptotically  approaches  a  maximum  which  in  turn  determines  the  speedup  that  can  be 
attained.  This  is  in  accordance  with  Amdahl’s  law.  However,  the  model  also  says  that 
unless  these  fundamental  parameters  are  treated  as  dynamic  entities,  Amdahl’s  formulation 
of  speedup  is  only  correct  for  np  =  oo.  The  argument  can  be  made  that,  for  the  a  priori 
definitions  of  the  fundamental  parameters  (i.e.,  s,  p,  etc.),  the  formulations  of  speedup 
expressed  in  Ref.  1  and  Ref.  2  are  correct.  However,  the  definitions  place  the  overly  restrictive 
assumption  that  work  represented  by  p  is  equally  partitioned  among  all  processors.  If. 
however,  the  new  interpretations  presented  herein  are  given  to  the  fundamental  parameters, 
they  can  serve  as  parameters  in  a  general  model  of  parallel  performance. 

3.5  General  Model  of  Parallel  Performance 

This  section  presents  a  group  of  assumptions  and  their  consequences  about  the  fundamental 
parameters  which  collectively  represent  a  dynamic  model  of  parallel  performance.  There  is 
one  primary  assumption,  Al,  and  two  secondary  assumptions,  and  a2. 


(Al)  The  definitions  of  speedup  and  efficiency  presented  in  this  paper  are  equivalent.  That 
is,  speedup  and  efficiency  are  absolute,  not  relative  entities. 

A  partial  justification  for  this  assumption  comes  from  the  observation  that  each  of  the  three 
definitions  of  speedup  (Eqs.  6,  8,  and  9)  was  motivated  by  the  original  concept  of  speedup  as 
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the  ratio  of  the  execution  times  required  to  perform  the  task  in  serial  and  parallel  modes  (Eq. 
1).  The  reconciliation  of  the  different  views  of  speedup  based  on  (Al)  led  to  the  following 
consequences  (Eqs.  10,  11,  and  14): 

(Cl)  s  s  ( ww/wa )  s'  =  (ww/we)  (^) 

_  ,  I  s'/s  ificu?>0 

(C2)  Speedup  =  < 

I  np  if  ww  =  0 

The  empirical  and  theoretical  evidence  presented  in  Section  3.3  suggests  possible  behavior  for 
the  parameters  s  and  s'  that  will  be  incorporated  into  the  model  as  the  following  secondary 
assumptions: 

(«i)  s  is  an  increasing  function  of  np  (for  constant  problem  size  n  and  rip  sufficiently  large). 

(a2)  s  is  a  decreasing  function  of  n  (for  a  constant  number  of  processors  rip  and  sufficiently 
large  n). 

Using  reasoning  similar  to  the  type  presented  in  Section  3.1,  one  can  show  that  (Al)  and 
(ai)  imply  the  following  consequences: 

(C3)  lim  s  =  Soo  exists  and  lim  5  =  1/soo- 

rip— oo  rip— oo  ' 

(C4)  s'  is  an  increasing  function  of  np  and  limnp-.oo  s'  =  1. 

(C 5)  E  is  a  decreasing  function  of  np  and  limnp— oo  E  =  0. 

Since  s  is  an  increasing  function  of  rip  and  s  <  1,  (C3)  follows.  To  establish  (C4),  by  Eq.  17 
one  can  write 


1  +  (1/s  -  1)  *  (1/rip)  1+r 


Efficiency  = 


p'/p  if  ww  >  0 
1  if  ww  =  0 
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Figure  10.  Relationship  between  s  and  s';  np  =  10. 

By  (ui),  as  np  increases,  s  increases,  so  (1/s  —  1)  decreases.  Since  1/rip  also  decreases,  r  de¬ 
creases  so  s'  increases.  By  (C3),  exists  and  is  nonzero,  so  limnp-.oo  r  =  0  =►  limnp— oo  s'  = 
1,  which  establishes  (C4).  To  establish  (C5),  write  the  definition  of  efficiency  in  the  form 

E  _  $  ^  1  _  1 

np  np[s  + (1  -  s)/np]  s(np- 1)4-1 

By  (di),  as  np  increases,  s  increases,  so  s(np  — 1)4-1  increases,  which  implies  that  E  decreases. 
By  (C3),  Soo  exists  and  is  nonzero,  so  limnp-.oo  E  =  0. 

Assumption  (02)  involves  the  parameter  n  which  is  not  explicit  in  any  of  the  expressions  that 
have  been  presented.  However,  assuming  a  constant  number  of  processors,  one  can  state  a 
basic  relationship  between  sf  and  s  that  can  be  used  to  derive  consequences  of  (a2).  As 
previously  noted,  one  can  write  s'  in  the  following  form: 


s(l  -  1/rip)  4-  l/np  As  4-  B 

where  A  =  1  —  l/np  and  B  =  1  /np.  Viewing  s  and  s'  as  real  variables  in  the  range  (0, 1). 
one  computes  ds'/ds  =  B/(As  4-  B)2  >  0  and  d^s'jds2  =  -2 AB/(As  4-  Bf  <  0.  Then  one 
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obtains  the  graphical  representation  of  the  relationship  between  s  and  d  found  in  Fig.  10 
where  the  average  slope  of  the  graph  at  a  particular  s  represents  speedup.  Using  Fig.  10 
and  (02),  one  can  derive  the  following  consequences  (for  simplicity  the  parenthetical  remarks 
stated  in  (a2)  are  omitted). 

(C6)  s'  is  a  decreasing  function  of  n. 

(C 7)  S  and  E  are  increasing  functions  of  n. 

Since  ds'/ds  >  0,  it  follows  from  (02)  that  (C6)  holds.  Furthermore,  since  S  is  the  average 
slope  of  a  concave  downward  graph  (1 fs'/ds 2  <  0),  it  follows  that  as  s  decreases,  S  increases. 
Therefore,  by  (02),  as  n  increases,  S  also  increases.  Since  E  =  S/rip  and  np  is  fixed,  it  follows 
that  E  increases  as  n  increases,  so  (C7)  is  established. 

The  assumptions  and  consequences  presented  above  can  be  interpreted  in  terms  of  graphical 
diagrams  like  the  ones  presented  in  Subsection  2.2.2.  This  interpretation  provides  an  intuitive 
explanation  of  the  model  in  terms  of  the  behavior  of  the  work  quantities  wa,  ww,  and  we. 

As  usual,  in  Fig.  11,  the  shaded  area  represents  work  wasted,  ww,  and  the  white  areas 
represent  the  work  accomplished,  wa.  Diagram  1  illustrates  the  model  for  increasing  rip 
and  diagram  2  illustrates  the  model  for  increasing  n.  For  sample  interpretations,  note  in 
diagram  1  that  the  ratio  of  the  width  of  the  shaded  area,  ww/(np  —  1),  to  the  total  area,  we. 
is  increasing  and  approaches  1  as  rip  increases.  This  is  equivalent  to  saying  that  d  increases 
and  approaches  1  as  rip  increases.  Similarly,  one  can  see  that, as  rip  increases  the  efficiency, 
which  is  the  ratio  of  the  white  area  to  the  total  area,  u>e  is  decreasing  and  approaching  0. 
On  the  other  hand,  based  on  diagram  2,  one  sees  that  d  (which  is  the  ratio  of  the  width  of 
the  shaded  area  to  the  total  area)  decreases  as  n  increases,  while  the  efficiency  increases. 

In  conclusion,  the  dynamics  of  the  fundamental  parameters  can  be  shown  to  provide  a  type 
of  summary  of  the  behavior  of  the  parallel  algorithm.  Figure  12  contains  the  (approximate) 
level  curves  for  the  5,  E,  s,  and  s'  parameters  for  the  Cholesky  algorithm  implementation 
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Diagram  One  :  Fixed  Problem  Size 
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Figure  11.  Boxed  representations  of  parallel  tasks. 
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discussed  in  Section  3.  Figure  12b  shows  that  constant  efficiency  can  be  maintained  by  a 
linear  choice  of  pairs  (n,rip)  as  n,np  —*  oo.  This  choice  of  pairs  will  result  in  increasingly 
larger  speedups,  since  each  efficiency  curve  intersects  the  hyperbolic  speedup  curves  in  in¬ 
finitely  many  points.  Moreover,  in  Fig.  12a  each  line  representing  a  constant  value  of  is  a 
horizontal  asymptote  for  the  speedup  curve  given  by  S  =  np.  Similarly,  in  Fig.  12c  the  lines 
representing  constant  n  values  are  vertical  asymptotes  for  the  constant  s  curves  and  constant 
speedup  curves.  In  Fig.  12d,  a  constant  ^  is  also  maintained  by  a  linear  combination  of  n 
and  rip.  This  statement  is  consistent  with  the  findings  reported  in  Ref.  2  that  a  constant  d 
value  (and  hence  a  linear  speedup)  can  be  obtained  for  certain  problems  by  allowing  n  to 
increase  linearly  with  rip.  Finally,  the  similarity  of  the  level  curves  in  Fig.  12b  and  12d  can 
be  explained  by  noting  that,  as  the  number  of  processors  increases,  the  difference  between 
the  efficiency  E  and  the  parallel  fraction  //  approaches  zero  [limnp-.oo(F  —  p')  =  0  ]. 
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4.  Conclusions 

The  widespread  practice  of  measuring  parallel  performance  solely  in  terms  of  constant  entities 
a,p,  independent  of  n  and  rip,  leads  to  a  weak  model  which  has  very  little  predictive  or 
descriptive  value.  Historically,  this  practice  originated  in  the  framework  of  vector  processing, 
where  the  vectorizable  portion  of  a  loop  constitutes  a  portion  of  work  that  can  be  executed 
with  100  percent  efficiency,  while  the  unvectorized  portion  constitutes  work  that  must  be 
performed  serially  (one  operation  at  a  time).  This  same  reasoning,  however,  does  not  apply 
to  parallel  processing  since,  in  general,  tasks  are  composed  of  streams  of  operations.  In 
this  context,  it  is  the  sequential  nature  of  the  operations  in  a  given  stream  that  bounds 
the  speedup  attainable  by  an  algorithm.  Because  independent  streams  may  have  different 
lengths,  partitioning  these  for  parallel  execution  is  not  as  simple  as  dividing  them  equally 
among  available  processors  (i.e.,  p/rtp).  The  incongruent  packing  of  these  streams  yields  idle 
periods  which  contribute  to  the  inefficiency  of  the  system  and  result  in  the  waiting  behavior 
which  characterizes  the  algorithm’s  performance. 

Therefore,  the  work  parameters  ww,  wa ,  and  we  introduced  in  this  report  are  useful  because 
they  provide  a  natural  means  for  interpreting  the  performance  of  a  parallel  algorithm  for 
various  problem  and  ensemble  sizes  in  terms  of  its  waiting  behavior.  These  parameters  also 
provide  a  connecting  link  between  the  overly  simplistic  “serial”  and  “parallel”  fraction  pa¬ 
rameters  and  the  very  detailed  timing  statistics  obtained  empirically.  Moreover,  using  these 
parameters  to  describe  the  equivalence  of  the  alternate  formulations  of  speedup  introduces  a 
new  dynamic  view  of  the  “serial”  and  “parallel”  fractions  as  variables  depending  on  n  and  q,. 
This  new  point  of  view  permits  the  formulation  of  more  sophisticated  parallel  performance 
models  which  accurately  reflect  both  theoretical  results  and  empirical  observations. 
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4-1  Future  Research 

Plans  for  future  work  include  three  aspects.  The  first  aspect  involves  the  use  of  the  new 
speedup  and  efficiency  definitions  to  characterize  parallel  algorithms.  The  second  aspect 
involves  extending  the  formulations  of  speedup  by  introducing  time  as  a  variable  and  by 
replacing  the  “serial”  and  “parallel”  fraction  pairs  (s,p)  or  (i,  p')  with  probability  distribu¬ 
tions  based  on  the  number  of  processors.  The  third  aspect  involves  extending  the  models 
to  include  properties  of  scaled  speedup.  The  second  aspect  will  permit  the  fundamental 
parameters  to  be  treated  from  a  statistical  point  of  view.  The  concept  of  scaled  speedup 
(introduced  in  Ref.  2  and  Ref.  3)  is  important  because  of  its  connection  to  the  behavior  of 
speedup  on  various  curves  in  the  (np,  n)  plane. 
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Appendix  A.  Dynamics  of  Fundamental  Parameters 


The  following  tables  illustrate  the  behavior  of  the  fundamental  parameters  for  various  prob¬ 
lem  sizes  and  ensemble  sizes.  They  were  computed  based  on  tables  from  several  sources  as 
specified  in  the  table  captions. 
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Table  A.l.  Speedup,  s,  and  s'  —  Example  1.  Table  compiled  on  the 
basis  of  Table  2,  Section  5.5,  p.  14  in  Ref.  2 . 


Number  Of  Processors  | 

Problem  Size 

4 

16 

64 

256 

1024 

Speedup 

9  x  212 

3.98 

15.86 

62.00 

226.395 

639 

9  x  210 

3.96 

15.52 

57.036 

167.225 

9x2® 

3.87 

14.32 

44.3 

9x2® 

3.60 

11.07 

9  x  24 

2.96 

s  Values 

9  x  212 

0.00115 

0.0006 

0.00051 

0.00052 

0.00059 

9  x  210 

0.00301 

0.0020 

0.00194 

0.00208 

9  x2s 

0.01067 

0.0077 

0.00759 

9x2® 

0.03663 

0.02967 

9  x  2* 

0.11708 

• 

s'  Values 

9  x  212 

0.00457 

0.00946 

0.03161 

0.11688 

0.37634 

9  x  210 

0.01192 

0.03201 

0.11034 

0.34814 

9x2® 

0.04136 

0.11163 

0.32876 

9x2® 

0.13201 

0.32852 

9  x  2* 

0.34659 
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Table  A. 2.  Speedup,  s,  and  s'  —  Example  2.  Table  compiled  on  the 
basis  of  Table  5,  Section  6.5,  p.  23  in  Ref.  2. 


Number  Of  Processors 

Problem  Size 

4  16  64  256  1024  | 

Speedup 

2  x  212 

3.959  15.473  58.534  201.630  319.150 

2  x  210 

3.908  14.812  51.260  132.259 

2  x  2* 

3.780  13.136  34.091 

2x2® 

3.472  8.990 

2  x  24 

2.578 

s  Values 

2  x  212 

0.00341  0.00227  0.00148  0.00106  0.00093 

2  x  210 

0.00781  0.00535  0.00395  0.00367 

2x2® 

0.01942  0.01454  0.01393 

2x2® 

0.05069  0.05198 

2  x  2* 

0.18384 

s'  Values 

2  x  212 

0.01352  0.03512  0.08676  0.21322  0.49350 

2  x  210 

0.03053  0.07920  0.20223  0.48426 

2x2® 

0.07339  0.19095  0.47475  i 

2x2® 

0.17600  0.46731 

2  x  24 

0.47396 

1 
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Table  A.3.  Speedup,  s ,  and  s'  —  Example  3.  Table  compiled  on  the 
basis  of  Table  8,  p.  29  in  Ref.  2 . 


Number  Of  Processors 

Problem  Size 

4  16  64  256  1024 

Speedup 

2  x  211 

3.954  15.462  57.464  177.491  351.241 

2  x  2» 

3.925  14.781  47.182  98.542 

2  x  27 

3.805  12.599  28.088 

2  x  25 

3.437  8.182 

2  x  2s 

2.561 

s  Values 

2  x  211 

0.00387  0.00232  0.00181  0.00173  0.00187 

2  x  29 

0.00634  0.00550  0.00566  0.00627 

2  x  27 

0.01710  0.01800  0.02029 

2  x  2s 

0.05460  0.06370 

2  x  2s 

0.18730 

s'  Values 

2  x  211 

0.01529  0.03588  0.10375  0.30788  0.65763 

2  x  29 

0.02490  0.06123  0.26695  0.61748 

2  x  27 

0.06507  0.22674  0.57003 

2  x  2s 

0.18765  0.52121 

2  x  2s 

0.47967 
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Table  A.4.  Speedup,  s,  and  s'  —  Example  4.  Table  compiled  on  the 
basis  of  Table  1,  p.  8,  Example  1  in  Ref.  3. 


Number  Of  Processors 

Problem  Size 

4  16  64  256  1024 

Speedup 

2  x  212 

2.606  10.007  32.468  61.040  64.936 

2  x  210 

2.580  8.849  19.513  21.139 

2  x  2s 

2.468  6.129  7.600 

2x2® 

2.087  2.824 

2  x  21 

1.333 

s  Values 

2  x  212 

0.17824  0.03993  0.01542  0.01233  0.01444 

2  x  210 

0.18353  0.05388  0.03619  0.04357 

2x2® 

0.20702  0.10737  0.11779 

2x2® 

0.30556  0.31111 

2  x  2* 

0.66667 

s'  Values 

2  x  212 

0.46456  0.39956  0.50051  0.76455  0.93750 

2  x  210 

0.47365  0.47674  0.70615  0.92102 

2x2® 

0.51082  0.65806  0.89524 

2x2® 

0.63768  0.87843 

2x2* 

0.88889 
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Table  A. 5.  Speedup,  s,  and  s'  —  Example  5.  Table  compiled  on  the 
basis  of  Table  1,  p.  8,  Example  2  in  Ref.  3. 


Number  Of  Processors 

Problem  Size 

4  16  64  256  1024 

Speedup 

2  x  2n 

3.940  15.128  49.085  92.280  98.170 

2  x  210 

3.902  13.384  29.513  31.972 

2  x  2s 

3.727  9.258  11.480 

2x2® 

3.130  4.235 

2  x  2* 

2.111 

s  Values 

2  x  2ia 

0.00506  0.00384  0.00482  0.00696  0.00922 

2  x  210 

0.00840  0.01303  0.01855  0.02748 

2x2® 

0.02439  0.04855  0.07262 

2x2® 

0.09259  0.18519 

2x2* 

0.29825 

s'  Values 

2  x  212 

0.01993  0.05814  0.23674  0.64204  0.90501 

2  x  210 

0.03227  0.17447  0.54742  0.S7854 

2x2® 

0.09091  0.44946  0.83365 

2x2® 

0.28986  0.78631 

2x2* 

0.62963 
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